Introduction

Our team sought to answer an intriguing question: Can we predict the political party of twitter users from the words they tweet? After some discussion, we narrowed this question down to inference on current members of Congress. To this end, we utilized the Twitter API to gather the last year’s tweets from all Senate and House members. Taking a random sample of tweets, we distilled this huge mine of information into the density of the words used by each user, as a ratio compared to the person who used the word the most often.

With this data we performed unsupervised learning techniques like principal component analysis (PCA) and clustering, as well as supervised techniques like logistic regression and the random forest. Our aim in performing these methods was to infer from the data. By making models with improved predictive capability, we can gather more acute insights into the structure of the data and the language associated to political party.

Note: in general, the data transforms used take a while to run, so we pre-load the transformed data and only run the necessary code.

The Data

All files referenced in this section are in the DataCollection folder. Our sources for data collection are the two files representatives.txt and senators.txt, taken from the GWU Libraries Dataverse. These files contain the last 3,200 tweets from every member of the 115th Congress (the current session), excepting four members of the House who don’t have official Twitter accounts: Collin Peterson (D-MN-07), Lacy Clay (D-MO-01), Madeline Bordallo (Guam delegate), and Gregorio Sablan (Northern Mariana Islands delegate). Each of these files is a list of tweet IDs, which uniquely identify tweet objects in the Twitter API. Metadata about how user accounts were identified is stored in the corresponding README files. Using the script get_twitter_data.py, we pulled down a random sample of 10,001 tweets from the House of Representatives (10001_house.zip) and 50,000 tweets from the Senate (50000_senate.zip).

Our second data set is legislators-current.csv, which contains (among other variables) the following information on all current members of Congress: name, state, chamber (House or Senate), district (if House), party, website, and social media account names. We use this data set to identify the political party of each twitter account in the data set. Because this file comes from a different source than our twitter data and some politicians use multiple twitter accounts (for example, @POTUS versus @realDonaldTrump), some manual cleaning was needed to make sure all accounts in the twitter data set are present in the congress data set. In the script add_congress_data.R, we “fill in” this information, which mostly ended up just being replacements with different capitalization.

Now that we have two data sets that completely match on twitter username, we can transform the data into the form we want. The json_to_df.R script takes in the tweets as JSON files, extracts the information we’re interested in from each tweet, and creates a dataframe out of this. Each row of this dataframe is a tweet, and the columns are variables like tweet id, timestamp, text, and author. The tidy_text.R script parses out the content of the tweets and counts the occurrences of each word by user, scales each row and column, then joins this with the congress_df dataset to make full_data.RData. Each row of this dataset is a user, each column is a different word used, and the entries are scale proportions of how often a user used each word. For ease of computation, only words used by at least 10 distinct users were considered.

Exploratory Data Analysis

In the file make_plots.R, we plot some basic results of the data.

The top plot shows how often members of each party use each word on a log scale. For example, Republicans use the word “senate” about 0.6% of the time, where Democrats use it about 0.4% of the time. The red line represents equal usage between Democrats and Republicans. The bottom plot shows the log odds ratio log(Democrat usage/Republican usage) for the 15 words used most by each party compared to the other. Not all words can be shown in the first plot, so let’s break this up into a couple different categories.

While some of these make intuitive sense (more Democrats tag other Democrats, and vice versa), one interesting note is that Democrats tag both @housegop and @senatedems more, and Republicans are more likely to tag @foxnews, @foxbusiness, and @aipac (the American Israel Public Affairs Committee, a pro-Israel lobbying group).

In the use of hashtags, we see some opposites between the two parties: #obamacare vs. #trumpcare, #passthebill vs. #killthebill (in regards to the tax reform bill), #marchforlife and #pro_life vs. #istandwithpp. Some other perceived talking points of the two parties emerge: the Iran nuclear deal and the Keystone XL pipeline for Republicans, and climate change and the Trump-Russia investigation for Democrats.

The “regular words” (not hashtags or tagged users) used have a few more potentially uninteresting words (such as “morning”), but we can still see a few things:

Modeling

PCA

It turns out that visualizing a data set with 4345 variables is tricky, to say the least. To get around this, we applied PCA to see what actually made a difference in the data.

d <- full_data[,-(1:2)]
pca1 <- prcomp(d)
pc_df <- data.frame(PC = 1:20,
                    PVE = pca1$sdev[1:20]^2 / sum(pca1$sdev[1:20]^2))
ggplot(pc_df, aes(x = PC, y = PVE)) +
  geom_line() + 
  geom_point()

From the scree plot here, we can see that the first 3 PCs really account for the vast majority of the structure in the data.

scores_df <- data.frame(user = full_data$twitter,
                         party = full_data$party_id,
                         PC1 = pca1$x[,1],
                         PC2 = pca1$x[,2],
                         PC3 = pca1$x[,3],
                         PC4 = pca1$x[,4]) %>%
  left_join(congress_df, by = c("user" = "twitter"))

loading_df <- data.frame(word = colnames(d),pca1$rotation[ ,1:4])

ggplot(scores_df, aes(x=PC1, y = party)) + geom_jitter()

The first principal component does a pretty good job of encoding what party the user belongs to, Democrat (-) or Republican (+).

ggplot(scores_df, aes(x=PC2, y = chamber_type)) + geom_jitter()

The second principal component appears to distinguish Representatives (+) from Senators (-).

kable(arrange(loading_df, desc(PC3))[c(1:10,4336:4345), ])
word PC1 PC2 PC3 PC4
1 families -0.1239068 0.0117368 0.2008988 0.0796565
2 #trumpcare -0.3533959 0.0060057 0.1374581 0.0205737
3 #paymoreforless -0.1578449 0.0385303 0.1311770 0.0086349
4 seniors -0.0913230 0.0124293 0.1084742 0.0452491
5 #veteransday 0.0270323 -0.0755258 0.0879955 -0.1819736
6 hurt -0.0749704 0.0211459 0.0856073 0.0204101
7 coverage -0.1592510 0.0102826 0.0851376 0.0453163
8 advances 0.0275588 -0.0726092 0.0801361 -0.1647117
9 tie 0.0242893 -0.0590068 0.0733977 -0.1526220
10 #trumpcares -0.0537211 0.0136856 0.0710925 0.0265834
4336 #trumprussia -0.0376904 -0.0177762 -0.0960597 -0.0442235
4337 investigate -0.0281191 -0.0817810 -0.0988678 0.0606355
4338 credible -0.0300072 -0.0098416 -0.1010107 -0.0293986
4339 chairman -0.0100436 -0.0009404 -0.1021721 -0.0521272
4340 conduct -0.0272213 -0.0198882 -0.1037505 -0.0148771
4341 investigation -0.0397529 -0.0429024 -0.1378704 -0.0358674
4342 independent -0.1316147 -0.0430895 -0.1391207 -0.0441689
4343 committee -0.0335960 -0.0242492 -0.1434073 -0.0752751
4344 russia -0.0367527 -0.0614352 -0.1638510 -0.0174157
4345 nunes -0.0746828 -0.0205165 -0.1785077 -0.0933248

The third principal component weights users differently based on whether they talk more about health care (+) or the Russia investigation (-).

We expected the first one to be party, and spent a while trying to figure out what the second PC could be (but it makes sense that chamber shows up). The 3rd one, however, was the most surprising.

Below we’ve plotted a summary of the first two components, along with the non-text variables we think they best encode.

ggplot(scores_df,aes(x = PC1, y = PC2, color = party, shape = chamber_type)) +
  geom_point() +
  scale_color_manual(values=c("#619CFF","#00BA38","#F8766D")) +
  scale_shape_manual(values=c(1,16))

Clustering

km1 <- kmeans(d, centers = 1)
km2 <- kmeans(d, centers = 2, iter.max = 10, nstart = 20)
km3 <- kmeans(d, centers = 3, iter.max = 10, nstart = 20)
km4 <- kmeans(d, centers = 4, iter.max = 10, nstart = 20)
km5 <- kmeans(d, centers = 5)
km6 <- kmeans(d, centers = 6)
km7 <- kmeans(d, centers = 7)

bub <- data.frame(ClusterNumber = 1:7,
                  tot.within.ss = c(km1$tot.withinss,
                                    km2$tot.withinss,
                                    km3$tot.withinss,
                                    km4$tot.withinss,
                                    km5$tot.withinss,
                                    km6$tot.withinss,
                                    km7$tot.withinss
                                    ))
ggplot(bub, aes(x = ClusterNumber, y = tot.within.ss)) +
  geom_line() +
  geom_point()

This is a scree plot for the number of clusters applied to the data. No clear elbow exists in the plot, implying that there is no strong clustering of the data. Below we plot some of these clusters on the first two principal components.

cluster_df <-data.frame(party = scores_df$party,
                        chamber = scores_df$chamber_type,
                        PC1 = pca1$x[,1],
                        PC2 = pca1$x[,2],
                        k2 = km2$cluster, k3=km3$cluster, k4=km4$cluster)
ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k2),shape=party)) +
  geom_point() +
  scale_shape_manual(values=c(1,17,16))

ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k3))) + geom_point()

ggplot(cluster_df, aes(x = PC1, y = PC2, color = as.factor(k4))) + geom_point()

In this analysis, 2 clusters separate the parties, 3 clusters group the entire Senate together and split the House by party, and 4 clusters adds a mysterious 4th group (sometimes this splits the Senate into parties and sometimes it sprinkles group 4 throughout, it’s very variable). We can see how well the 2-clustering assigns to party:

conf <- table(cluster_df$k2, cluster_df$party)
kable(conf)
Democrat Independent Republican
57 1 269
178 1 0

If we consider it to be a “classification model”, the 2-cluster has a MCR of 0.1146245. Overall, the clustering seems to agree with our PCA in that the most identifiable feature is party, followed by chamber.

Naive Model

naive_mcr <- mean(scores_df$party != "Republican")
kable(scores_df %>% group_by(party) %>%
  summarize(n = n(),
            prop = n/506))
party n prop
Democrat 235 0.4644269
Independent 2 0.0039526
Republican 269 0.5316206

Our most naive model is just the mode, that every politician in the data set is a Republican. This gives us a misclassification rate of 0.4683794. Any model that can improve on this (not a hard task) will give us more insight into the data.

Logistic Models

Because (with 2 exceptions) we’re seeking to classify into two parties, logistic regression makes sense. To fit the model, we remove the two independent senators (Bernie Sanders, VT; and Angus King, ME) from our data set. Because of the exceedingly large number of predictors, a restricted model with the lasso or ridge techniques is appealing. We use 5-fold cross-validation to prevent overfitting our models.

no_ind <- filter(full_data, party_id != "Independent") %>%
  mutate(party_id = factor(as.character(party_id)))

logit_ridge <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0)
ridge_grid <- exp(seq(0,5,length.out=50))
ridge_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=0, nfolds=5, type.measure = "class", lambda = ridge_grid)
ridge_bestlam <- ridge_cv$lambda.min
ridge_pred <- predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
ridge_mcr <- mean(ridge_pred != full_data$party_id)

logit_lasso <- glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1)
lasso_grid <- exp(seq(-6,-2,length.out=50))
lasso_cv <- cv.glmnet(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, family = "binomial", alpha=1, nfolds=5, type.measure = "class", lambda = lasso_grid)
lasso_bestlam <- lasso_cv$lambda.min
lasso_pred <- predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "class")
lasso_mcr <- mean(lasso_pred != full_data$party_id)

plot(ridge_cv)

plot(lasso_cv)

Because both models perform better than \(\lambda=0\) (regular logistic regression), we can feel confident in choosing one of these over the full logistic model. For the dataset, our ridge MCR is 0.0375494, and our lasso MCR is 0.0039526. The lasso being better makes some intuitive sense, as we would expect some words to be meaningless for prediction. We can examine which words were non-zero in the lasso model:

kable(data.frame(word = colnames(full_data)[-(1:2)],
                coeff = as.vector(predict(logit_lasso, s = lasso_bestlam, type = "coefficients"))[-1]) %>%
  filter(coeff !=0) %>%
  arrange(desc(coeff)))
word coeff
forward 0.2333976
obama 0.1789967
#obamacare 0.0812549
energy 0.0285472
savings -0.0052260
sj -0.0113303
corporations -0.0149638
deserve -0.0156128
bill -0.0302886
massive -0.0551805
bipartisan -0.0577781
critical -0.0673239
hate -0.1971708
ties -0.1992535
tear -0.2071270
predatory -0.2208062
environment -0.2324100
recreation -0.2368057
stem -0.2720241
demands -0.2759973
protect -0.2779761
facebook -0.3401555
transparent -0.3455557
background -0.3685572
civil -0.3800988
#usa -0.3821922
deal -0.4151496
prioritize -0.4274343
@sencortezmasto -0.4464951
discrimination -0.4782126
sad -0.4982090
farming -0.4989988
@timkaine -0.5070158
default -0.5304747
million -0.5305152
recuse -0.5458326
afford -0.6293077
#paymoreforless -0.6443399
@housegop -0.6542512
medicare -0.6599443
people -0.6661578
@gop -0.6766630
bannon -0.6868695
tomorrows -0.7152784
backwards -0.7180170
services -0.7305055
constitution -0.7397737
trumpcare -0.7580202
@senrobportman -0.7804709
nunes -0.8140484
aca -0.8315549
acres -0.8683048
sens -0.8809917
environmental -0.8810839
shutdown -0.8825484
resignation -0.9004789
pulling -0.9408735
average -0.9684261
scott -1.0238853
@senfranken -1.0275131
@epa -1.0299424
fortune -1.0425528
coverage -1.0486840
robotics -1.1021268
foreign -1.1889158
voices -1.2233117
gop -1.3639976
homes -1.3914316
#actonclimate -1.5335188
interference -1.5472966
extreme -1.5521337
dont -1.6551365
tuned -1.6667338
task -1.6878985
transgender -1.7027096
independent -1.7118178
blame -1.7548780
americans -1.7549179
voting -1.8344724
internet -1.8748075
pruitts -1.9214229
hour -1.9845166
#equalpayday -1.9969170
fargo -2.0660914
trumps -2.2601759
seniors -2.2677727
package -2.3519416
partisan -2.4587795
cut -2.7464459
dem -2.9027135
gops -3.0319127
#aca -3.0869864
oppose -3.2637105
overdose -3.2804956
pay -3.3137125
base -3.5131112
trump -4.1334327
unacceptable -4.1415088
#broadbandprivacy -4.5702820
#trumpcare -6.8340150

In this list we can see some of the same topics that we found through PCA analysis, like health care and the Russia investigation, as well as net neutrality and the EPA/climate change. Judging by the magnitude of the coefficients on each side of 0, more words typically used by Democrats (as determined by our exploratory data analysis) were important in deciding what party a user belonged to.

We can also check which users were misclassified by the lasso and ridge models.

lasso_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = lasso_pred,
                     prob = as.vector(predict(logit_lasso, s = lasso_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != X1) %>%
  arrange(desc(prob))
kable(lasso_missed)
twitter state party X1 prob
SenAngusKing ME Independent Republican 0.8576747
SenSanders VT Independent Democrat 0.1172869
ridge_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = ridge_pred,
                     prob = as.vector(predict(logit_ridge, s = ridge_bestlam, newx=data.matrix(full_data[ ,-(1:2)]), type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != X1) %>%
  arrange(desc(prob))
kable(ridge_missed)
twitter state party X1 prob
repdavidscott GA Democrat Republican 0.5826187
SenAngusKing ME Independent Republican 0.5779478
RepGonzalez TX Democrat Republican 0.5712017
RepAlGreen TX Democrat Republican 0.5631460
RepSinema AZ Democrat Republican 0.5514052
AnthonyBrownMD4 MD Democrat Republican 0.5382995
RepBetoORourke TX Democrat Republican 0.5323220
RepJimCosta CA Democrat Republican 0.5312471
RepDerekKilmer WA Democrat Republican 0.5282319
Sen_JoeManchin WV Democrat Republican 0.5174060
SenatorTester MT Democrat Republican 0.5140082
RepOHalleran AZ Democrat Republican 0.5125567
SenDonnelly IN Democrat Republican 0.5113260
RepStephMurphy FL Democrat Republican 0.5090062
MarkWarner VA Democrat Republican 0.5085220
RepJoshG NJ Democrat Republican 0.5067471
SenatorHeitkamp ND Democrat Republican 0.5062373
RepTomSuozzi NY Democrat Republican 0.5004556
SenSanders VT Independent Democrat 0.2426629

Independents Angus King and Bernie Sanders both caucus with the Democrats, so we can consider Senator Sanders’ classification correct. Of note is that both of our logistic models only misclassified Democrats as Republicans! In addition, many of these congresspeople are Democratic legislators from majority Republican states like West Virginia, Texas, and Georgia.

Plotting these missed users among all the points, we see that most of them are Democrats grouped in the Republican cloud to the right. This seems to agree with our earlier statement that PC1 encodes party.

Boosted Tree

binary <- mutate(no_ind, party_id = ifelse(party_id == "Democrat", 0, 1))
boost_tweet <- gbm(party_id ~ .-twitter, data = binary,
                 n.trees = 1000,
                 shrinkage = 0.03)
## Distribution not specified, assuming bernoulli ...
boost_pred <- predict(boost_tweet,
                      newdata = full_data,
                      n.trees = 1000,
                      type = "response") > .5
boost_pred <- replace(boost_pred, boost_pred==TRUE, "Republican")
boost_pred <- replace(boost_pred, boost_pred==FALSE, "Democrat")
boost_mcr <- mean(boost_pred != full_data$party_id)

kable(head(summary(boost_tweet),20))
var rel.inf
#trumpcare #trumpcare 44.9478384
#aca #aca 5.3611618
coverage coverage 3.3904559
million million 2.2027373
trump trump 1.9996101
obamacare obamacare 1.8571873
protect protect 1.8022129
people people 1.7743480
seniors seniors 1.5678119
aca aca 1.5350070
oppose oppose 1.5235915
bill bill 1.4307878
unacceptable unacceptable 1.3958775
voting voting 1.3830423
gop gop 1.2531726
aisle aisle 1.0732624
introduced introduced 0.9968308
ties ties 0.9689715
#broadbandprivacy #broadbandprivacy 0.9048346
dont dont 0.7809643

Our boosted tree’s MCR is 0.013834. In the boosted tree, we really see #trumpcare stand out in variable importance, as well as themes like health care and net neutrality.

boost_missed <- data.frame(twitter = full_data$twitter,
                     state = scores_df$state,
                     party = full_data$party_id,
                     pred = boost_pred,
                     prob = as.vector(predict(boost_tweet,
                      newdata = full_data,
                      n.trees = 1000,
                      type = "response")),
                     stringsAsFactors = FALSE) %>%
  filter(party != pred) %>%
  arrange(desc(prob))
kable(boost_missed)
twitter state party pred prob
SenAngusKing ME Independent Republican 0.9208607
RepGonzalez TX Democrat Republican 0.7459837
AnthonyBrownMD4 MD Democrat Republican 0.6341021
RepAlGreen TX Democrat Republican 0.5816371
repdavidscott GA Democrat Republican 0.5426301
RepJimCosta CA Democrat Republican 0.5236780
SenSanders VT Independent Democrat 0.2046912

We see that many of the same congresspeople get misclassified in the boosted tree as in the logistic models.

Random Forest

We attempted regular bagging as well, but that was computationally infeasible.

tweet_rf <- randomForest(data.matrix(no_ind[ ,-(1:2)]), no_ind$party_id, importance = TRUE)
rf_pred <- predict(tweet_rf, newdata = data.matrix(full_data[ ,-(1:2)]), type = "response")
rf_mcr <- mean(rf_pred != as.character(full_data$party_id))

Our random forest misclassification rate is 0.0059289. This is slightly higher than the lasso, but still a good bit better than the ridge model.

importance <- data.frame(tweet_rf$importance) %>%
  rownames_to_column() %>%
  arrange(desc(MeanDecreaseAccuracy))
kable(head(importance, 20))
rowname Democrat Republican MeanDecreaseAccuracy MeanDecreaseGini
#trumpcare 0.0127390 0.0468019 0.0306329 9.713332
#paymoreforless 0.0054547 0.0248032 0.0157184 5.129080
#aca 0.0005295 0.0242226 0.0130932 4.864742
#broadbandprivacy 0.0013899 0.0175384 0.0099967 3.892037
oppose 0.0025342 0.0143878 0.0088295 3.107782
#protectourcare 0.0007879 0.0156880 0.0087919 3.221491
coverage 0.0032383 0.0132550 0.0085739 4.243509
independent 0.0010066 0.0118849 0.0067837 2.808360
lose 0.0013026 0.0110005 0.0064029 3.225091
voting 0.0018554 0.0103140 0.0063605 4.000431
million 0.0007622 0.0097789 0.0056329 3.044651
cuts 0.0004832 0.0100751 0.0056108 2.070664
americans 0.0029419 0.0074259 0.0053520 3.337898
republicans 0.0009553 0.0086694 0.0050252 2.743931
aca 0.0012165 0.0077063 0.0047266 2.035161
wealthy 0.0000836 0.0083623 0.0044675 1.950740
@housegop 0.0003081 0.0077733 0.0042920 2.235260
seniors 0.0013829 0.0067745 0.0042918 2.214019
commission 0.0001714 0.0074168 0.0039913 1.874837
gop 0.0006458 0.0067122 0.0039435 2.574212

Many of the same words from earlier appear to have high variable importance in the random forest we fit. The most important words here also correspond to words that are most often used by Democrats, which is interesting.

rf_missed <- data.frame(twitter = full_data$twitter,
                        state = scores_df$state,
                        party = full_data$party_id,
                        pred = rf_pred,
                        prob = predict(tweet_rf,
                                                 newdata = data.matrix(full_data[ ,-(1:2)]),
                                                 type = "prob")[ ,2],
                     stringsAsFactors = FALSE) %>%
  filter(party != as.character(pred)) %>%
  arrange(desc(prob))
kable(rf_missed)
twitter state party pred prob
SenAngusKing ME Independent Republican 0.750
RepGonzalez TX Democrat Republican 0.564
SenSanders VT Independent Democrat 0.140

In addition to the two Independents, the forest misclassified Rep. Vicente González of Texas’ 15th Congressional District.

Discussion

To recap, our models’ misclassification rates were:

Our first 3 PCs encoded party, chamber, and whether a user talked more about health care or the Russia investigation, respectively. Our clustering grouped first by party, then by chamber.

We can think about which members of Congress our non-naive models found it harder to classify.

missed <- data.frame(missed = unique(c(ridge_missed$twitter, lasso_missed$twitter, boost_missed$twitter, rf_missed$twitter))) %>%
  left_join(congress_df, by = c("missed" = "twitter"))
kable(missed)
missed last_name first_name chamber_type state party_id
repdavidscott Scott David rep GA Democrat
SenAngusKing King Angus sen ME Independent
RepGonzalez Gonzalez Vicente rep TX Democrat
RepAlGreen Green Al rep TX Democrat
RepSinema Sinema Kyrsten rep AZ Democrat
AnthonyBrownMD4 Brown Anthony rep MD Democrat
RepBetoORourke O’Rourke Beto rep TX Democrat
RepJimCosta Costa Jim rep CA Democrat
RepDerekKilmer Kilmer Derek rep WA Democrat
Sen_JoeManchin Manchin Joe sen WV Democrat
SenatorTester Tester Jon sen MT Democrat
RepOHalleran O’Halleran Tom rep AZ Democrat
SenDonnelly Donnelly Joe sen IN Democrat
RepStephMurphy Murphy Stephanie rep FL Democrat
MarkWarner Warner Mark sen VA Democrat
RepJoshG Gottheimer Josh rep NJ Democrat
SenatorHeitkamp Heitkamp Heidi sen ND Democrat
RepTomSuozzi Suozzi Thomas rep NY Democrat
SenSanders Sanders Bernard sen VT Independent

Again, we ignore Senator Sanders’ misclassification because he is considered farther to left than the rest of the Democratic party and caucuses with Democrats. Many of the people misclassified are members of the Blue Dog Coalition, a House Caucus of “fiscally-responsible Democrats” who are traditionally more conservative than the party in general.

blue_dogs <- c("Costa", "Cuellar", "Lipinski", "Bishop", "Cooper", "Correa", "Cirst", "Gonzalez", "Gottheimer", "Murphy", "O’Halleran", "Peterson", "Schneider", "Schrader", "Scott", "Sinema", "Thompson", "Vela")
kable(filter(missed, !(last_name %in% c(blue_dogs, "Sanders"))))
missed last_name first_name chamber_type state party_id
SenAngusKing King Angus sen ME Independent
RepAlGreen Green Al rep TX Democrat
AnthonyBrownMD4 Brown Anthony rep MD Democrat
RepBetoORourke O’Rourke Beto rep TX Democrat
RepDerekKilmer Kilmer Derek rep WA Democrat
Sen_JoeManchin Manchin Joe sen WV Democrat
SenatorTester Tester Jon sen MT Democrat
SenDonnelly Donnelly Joe sen IN Democrat
MarkWarner Warner Mark sen VA Democrat
SenatorHeitkamp Heitkamp Heidi sen ND Democrat
RepTomSuozzi Suozzi Thomas rep NY Democrat

Of the remaining members, many come from rural, southern, or typically Republican states, and our models may have picked up on some of the topics they tweet about that line up more with Republicans’ tweets. One thing of note is that every model misclassified Maine Senator Angus King as a Republican, even though he is an Independent who caucuses with the Democrats. Maine has a history of strong independent parties, and King is a former Democrat who left the party before running for governor (against Susan Collins, the other current senator from Maine). Upon leaving the party, King stated that “The Democratic Party as an institution has become too much the party that is looking for something from government,” indicating that he has some views sympathetic with Republicans (or at least dissimilar to Democrats).

In terms of variable importance, our models noted many of the same words that we saw in our exploratory data analysis. #trumpcare was almost always the most important variable, and important topics included health care, the Russia investigation, and net neutrality. Most of the words considered “important” were words used more often by Democrats, which is interesting. One reason for this might be that Democrats occupied a wider range of scores on PC1, indicating that their tweets were more dissimilar and thus harder to classify.

Ideas for Further Analysis

While our research question just focused on inference, originally we set out to build a predictive model. Because politicians’ official Twitter accounts use such different words than “regular people”, however, we would have either had to

We found logistical and ethical issues with the first option, and didn’t see the point in building the second model, as almost all politicians list their party openly on their twitter account. This second method could be used, however, to maybe predict how a “non-partisan” elected official would actually act in practice. Different data would maybe be required to fit that specific need.

Because we were only interested in inference, we didn’t see as much of a need to worry about overfitting or model validation as if we were building predictive models. To verify the inferences we made, another data set could be created and used to test our models. However, this test data set, although it would certainly contain different tweets from our training data, would have tweets from the same people as our training data. The two data sets would not be independent in this way. Because our data set for fitting models had a relatively low ratio of observations to predictors (< 1/4), we decided to use all the data we had collected rather than just a random sample of users.

References

Littman, Justin, 2017. “115th U.S. Congress Tweet Ids”, Harvard Dataverse, V1, http://dx.doi.org/10.7910/DVN/UIVHQR.

Repository “congress-legislators” in GitHub group “unitedstates”. https://theunitedstates.io/congress-legislators/legislators-current.csv.